[MXNET-422] Distributed training tutorial #10955

indhub · 2018-05-15T19:34:35Z

Description

Distributed training tutorial

Note: Images might not be visible until dmlc/web-data#67 is merged.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Code is well-documented:
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html

ThomasDelteil

Great tutorial again @indhub! Much needed, we had very little on distributed training with Gluon. What would be really awesome would be training on CIFAR10, and having a graph of validation_accuracy / time for 1 host 4 GPU and 2 hosts 8 GPU. To show the improvement in performance.

ThomasDelteil · 2018-05-16T18:29:24Z

docs/tutorials/index.md

@@ -38,6 +38,7 @@ Select API:&nbsp;
    * [Visual Question Answering](http://gluon.mxnet.io/chapter08_computer-vision/visual-question-answer.html) <img src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg" alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>
 * Practitioner Guides
    * [Multi-GPU training](http://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) <img src="https://upload.wikimedia.org/wikipedia/commons/6/6a/External_link_font_awesome.svg" alt="External link" height="15px" style="margin: 0px 0px 3px 3px;"/>
+    * [Distributed Training](https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training)


Not a big fan to link to the github repo. I understand the tutorial is not runnable but I would prefer if it was hosted in the tutorials/docs/gluon to avoid fragmentation. Add the right import statements, and put in

runnable: ```python store = mxnet.kv.create('dist') ``` not meant to be run: ``` for batch in train_data: train_batch(batch, ctx, net, trainer) ```

That way the code you are using is tested against the CI for correctness.

I don't think the CI system can currently test distributed training. I think you are trying to say part of the code can be tested. Note that it will only test very trivial parts and Python code that is not tested (which is majority of the code) will lose syntax highlighting which will spoils user experience.

ThomasDelteil · 2018-05-16T18:30:02Z

example/distributed_training/README.md

@@ -0,0 +1,231 @@
+# Distributed Training using Gluon


Please move this to docs/tutorials and change this README.md to refer this file instead

If I move this to docs/tutorials, I can't make it pass the tutorials tests without putting most of the code under "```" instead of "```python". This will remove syntax highlighting and result is bad user experience. If there is a way to whitelist this tutorial from the automated tests, please let me know.

I only see this block not being able to run in a notebook?

+```python +for batch in train_data: + # Train the batch using multiple GPUs + train_batch(batch, ctx, net, trainer) +```

Few more blocks:

store = mxnet.kv.create('dist’)
won’t work without a bunch of environment variables being defined. launch.py defines these variables.

trainer = gluon.Trainer(net.collect_params(),
won’t work because net is not defined

print("Total number of workers: %d" % store.num_workers)
won’t work because store cannot be created. See above.

I see, I just don't like the idea of pointing to a github readme for a tutorial from a UI and UX perspective and also have no test coverage on it. I'll let others weight on it, but if that's the only solution... I'm just not happy with it, considering what happened with the tutorial on the straight dope etc. I wonder if we could work out a specific test case where we run a distributed training in a bunch of docker containers.

If the concern is about test coverage, can we add distributed_training/cifar10_dist.py to nightly test suite, with some assertion of accuracy on validation set?

ThomasDelteil · 2018-05-16T18:31:37Z

example/distributed_training/README.md

+
+![Multiple GPUs connected to multiple hosts](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/distributed_training.png)
+
+We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster.


a lot faster*

ThomasDelteil · 2018-05-16T18:32:36Z

example/distributed_training/README.md

+
+We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster.
+
+In this tutorial, we will train a LeNet network using MNIST dataset using two hosts each having four GPUs.


could we use CIFAR10 instead? Because multi host multi gpu for mnist is a bit overkill since it trains in 2-3 seconds on a single GPU already?

Valid point. I'll switch to CIFAR.

ThomasDelteil · 2018-05-16T18:35:06Z

example/distributed_training/README.md

+```python
+store = kv.create('dist')
+print("Total number of workers: %d" % store.num_workers)
+print("This worker's rank: %d" % store.rank)


it would be nice to have an example of the output of these functions in the .md file as well

ThomasDelteil · 2018-05-16T18:37:39Z

example/distributed_training/README.md

+
+- `-n 2` specifies the number of workers that must be launched
+- `-s 2` specifies the number of parameter servers that must be launched.
+- `--sync-dst-dir` specifies a destination location where the contents of the current directory with be rsync'd


with be => will be

ThomasDelteil · 2018-05-16T18:37:59Z

example/distributed_training/README.md

+- `-n 2` specifies the number of workers that must be launched
+- `-s 2` specifies the number of parameter servers that must be launched.
+- `--sync-dst-dir` specifies a destination location where the contents of the current directory with be rsync'd
+- `--launcher ssh` tells `launch.py` to use ssh to login to each machine in the cluster and launch processes.


login to => login on ?
launch processes => launch the processes ?

indhub · 2018-06-15T20:47:24Z

I'm considering testing this tutorial as part of nightly tests. While we might not have a distributed setup in the CI for testing this, it is possible to run all workers on the same machine. As far as there is enough memory in the GPU for two workers to work simultaneously, it should work and that should be good enough for testing the tutorial.

I think something like this should work:

python launch.py -n 2 -s 2 --launcher local python cifar10_dist.py

eric-haibin-lin · 2018-07-02T21:14:31Z

example/distributed_training/README.md

@@ -0,0 +1,255 @@
+# Distributed Training using Gluon
+
+Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, it could take several days to train big models. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts.


that CPUs -> than CPUs

eric-haibin-lin · 2018-07-02T21:23:39Z

example/distributed_training/README.md

@@ -0,0 +1,231 @@
+# Distributed Training using Gluon


If the concern is about test coverage, can we add distributed_training/cifar10_dist.py to nightly test suite, with some assertion of accuracy on validation set?

sandeep-krishnamurthy · 2018-07-30T22:34:30Z

@indhub - Can you update the PR with rebasing and we can get it merged? This tutorial is useful for the users.

sandeep-krishnamurthy · 2018-08-21T05:07:16Z

@indhub - Thanks for rebasing. Can you please comments on testing this tutorial in nightly test?

eric-haibin-lin · 2018-08-26T16:35:00Z

Given this is a very useful tutorial and to make distributed training tutorial testing work requires much more work, I'm merging this for now and created a issue to track this #12363

* First draft * Python syntax highlighting * Polishing * Add distributed MNIST * rename * Polishing * Add images * Add images * Link to the example python file. Minor edits. * Minor changes * Use images from web-data * Rename folder * Remove images from example folder * Add license header * Use png image instead of svg * Add distributed training tutorial to tutorials index * Use CIFAR-10 instead of MNIST. * Fix language errors * Add a sample output from distributed training * Add the output of store.num_workers and store.rank

indhub requested a review from szha as a code owner May 15, 2018 19:34

ThomasDelteil suggested changes May 16, 2018

View reviewed changes

szha requested review from eric-haibin-lin and removed request for szha May 21, 2018 21:35

eric-haibin-lin reviewed Jul 2, 2018

View reviewed changes

thomelane mentioned this pull request Jul 12, 2018

How to train model with multi machines #9186

Closed

indhub and others added 20 commits August 15, 2018 06:43

First draft

9ea10a6

Python syntax highlighting

9723c3f

Polishing

84b4417

Add distributed MNIST

a4f0c96

rename

286cbbd

Polishing

90f1ca3

Add images

c7f06be

Add images

8a1f051

Link to the example python file. Minor edits.

960eef5

Minor changes

b3859ea

Use images from web-data

ec5016c

Rename folder

a025b86

Remove images from example folder

f260f53

Add license header

de9a198

Use png image instead of svg

27335a3

Add distributed training tutorial to tutorials index

23966af

Use CIFAR-10 instead of MNIST.

5866c4e

Fix language errors

0d6cf2b

Add a sample output from distributed training

2207d5b

Add the output of store.num_workers and store.rank

8fcc69d

eric-haibin-lin mentioned this pull request Aug 26, 2018

distributed training notebook tests #12363

Open

eric-haibin-lin merged commit 84665e3 into apache:master Aug 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-422] Distributed training tutorial #10955

[MXNET-422] Distributed training tutorial #10955

indhub commented May 15, 2018 •

edited

Loading

ThomasDelteil left a comment •

edited

Loading

ThomasDelteil May 16, 2018

indhub May 17, 2018 •

edited

Loading

ThomasDelteil May 16, 2018

indhub Jun 4, 2018

ThomasDelteil Jun 4, 2018

indhub Jun 4, 2018

ThomasDelteil Jun 4, 2018

eric-haibin-lin Jul 2, 2018

ThomasDelteil May 16, 2018

ThomasDelteil May 16, 2018

indhub May 18, 2018

ThomasDelteil May 16, 2018

ThomasDelteil May 16, 2018

ThomasDelteil May 16, 2018

indhub commented Jun 15, 2018

eric-haibin-lin Jul 2, 2018

eric-haibin-lin Jul 2, 2018

sandeep-krishnamurthy commented Jul 30, 2018

sandeep-krishnamurthy commented Aug 21, 2018

eric-haibin-lin commented Aug 26, 2018


		![Multiple GPUs connected to multiple hosts](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/distributed_training/distributed_training.png)

		We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster.


		We will use data parallelism to distribute the training which involves splitting the training data across GPUs attached to multiple hosts. Since the hosts are working with different subset of the training data in parallel, the training completes lot faster.

		In this tutorial, we will train a LeNet network using MNIST dataset using two hosts each having four GPUs.

		@@ -0,0 +1,255 @@
		# Distributed Training using Gluon

		Deep learning models are usually trained using GPUs because GPUs can do a lot more computations in parallel that CPUs. But even with the modern GPUs, it could take several days to train big models. Training can be done faster by using multiple GPUs like described in [this](https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) tutorial. However only a certain number of GPUs can be attached to one host (typically 8 or 16). To make the training even faster, we can use multiple GPUs attached to multiple hosts.

[MXNET-422] Distributed training tutorial #10955

[MXNET-422] Distributed training tutorial #10955

Conversation

indhub commented May 15, 2018 • edited Loading

Description

Checklist

Essentials

ThomasDelteil left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

indhub May 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

indhub commented Jun 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandeep-krishnamurthy commented Jul 30, 2018

sandeep-krishnamurthy commented Aug 21, 2018

eric-haibin-lin commented Aug 26, 2018

indhub commented May 15, 2018 •

edited

Loading

ThomasDelteil left a comment •

edited

Loading

indhub May 17, 2018 •

edited

Loading